Abstract
Accurate whole-heart segmentation is a critical component in the precisediagnosis and interventional planning of cardiovascular diseases. Integratingcomplementary information from modalities such as computed tomography (CT) andmagnetic resonance imaging (MRI) can significantly enhance segmentationaccuracy and robustness. However, existing multi-modal segmentation methodsface several limitations: severe spatial inconsistency between modalitieshinders effective feature fusion; fusion strategies are often static and lackadaptability; and the processes of feature alignment and segmentation aredecoupled and inefficient. To address these challenges, we propose adual-branch U-Net architecture enhanced by reinforcement learning for featurealignment, termed RL-U$^2$Net, designed for precise and efficient multi-modal3D whole-heart segmentation. The model employs a dual-branch U-shaped networkto process CT and MRI patches in parallel, and introduces a novel RL-XAlignmodule between the encoders. The module employs a cross-modal attentionmechanism to capture semantic correspondences between modalities and areinforcement-learning agent learns an optimal rotation strategy thatconsistently aligns anatomical pose and texture features. The aligned featuresare then reconstructed through their respective decoders. Finally, anensemble-learning-based decision module integrates the predictions fromindividual patches to produce the final segmentation result. Experimentalresults on the publicly available MM-WHS 2017 dataset demonstrate that theproposed RL-U$^2$Net outperforms existing state-of-the-art methods, achievingDice coefficients of 93.1% on CT and 87.0% on MRI, thereby validating theeffectiveness and superiority of the proposed approach.